2024-01-16
Our web site: https://thomaselove.github.io/432-2024/
Visit the Calendar at the top of the page, which will take you to the Class 01 README page.
Just about everything is linked at https://thomaselove.github.io/432-2024
Every deliverable is listed in the Calendar.
Assignments include two projects, eight labs, ten minute papers and two quizzes.
Project A (publicly available data: linear & logistic models)
Project B (use almost any data and build specific models)
Eight labs, meant to be (generally) shorter than 431 Labs
Lab 8 can be done at any time, and involves building (or augmenting) a website for yourself.
Syllabus, Lab Instructions provide feedback details.
We WELCOME questions/comments/corrections/thoughts!
All return from working with students in 431 this past Fall, and I couldn’t be more grateful for their energy and effort. Learn more about the TAs in the Syllabus.
TA Zoom Office Hours begin this Friday 2024-01-19. Details coming soon to Canvas, our website, and our Shared Google Drive.
Some source materials are password-protected. What is the password?
“A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible….
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.”
Read Sections 3 (Data transformation) and 5 (Data tidying)
We want:
clean_names from the janitor package to turn everything into snake_case.Jenny Bryan’s advice on “Naming Things” hold up well. There’s a full presentation at SpeakerDeck.
Good file names:
Good file names:
Avoid: spaces, punctuation, accented characters, case sensitivity
Deliberately use delimiters to make things easy to compute on and make it easy to recover meta-data from the filenames.
Don’t spend a lot of time bemoaning or cleaning up past ills. Strive to improve this sort of thing going forward.
https://quarto.org/ is the main website for Quarto.
If you can write an R Markdown file, it will also work in Quarto, by switching the extension from .rmd to .qmd.
All material for this course is written using Quarto.
our_tibble into training/test samplesWe will place 60% of the penguins in our training sample, and require that similar fractions of each species occur in our training and testing samples. We use functions from the rsample package here.
We could use slice_sample() as in the Course Notes if we didn’t stratify by species.
species n percent
Adelie 87 43.9%
Chinstrap 40 20.2%
Gentoo 71 35.9%
Total 198 100.0%
species n percent
Adelie 59 43.7%
Chinstrap 28 20.7%
Gentoo 48 35.6%
Total 135 100.0%
ggplot(data = our_train,
aes(x = species, y = bill_length_mm)) +
geom_violin(aes(fill = species)) +
geom_boxplot(width = 0.3, notch = TRUE) +
stat_summary(fill = "purple", fun = "mean",
geom = "point",
shape = 23, size = 3) +
facet_wrap(~ sex) +
guides(fill = "none") +
labs(title = "Bill Length, by Species, faceted by Sex",
subtitle =
glue(nrow(our_train), " of the Palmer Penguins"),
x = "Species", y = "Bill Length (in mm)")Analysis of Variance Table
Response: bill_length_mm
Df Sum Sq Mean Sq F value Pr(>F)
species 2 3962.6 1981.32 353.74 < 2.2e-16 ***
sex 1 587.3 587.31 104.86 < 2.2e-16 ***
Residuals 194 1086.6 5.60
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m2 <- lm(bill_length_mm ~ species, data = our_train)
## anova(m2) yields p-value < 2.2e-16 (not shown here)
tidy(m2, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, conf.low, conf.high) |>
kable(digits = 1)| term | estimate | conf.low | conf.high |
|---|---|---|---|
| (Intercept) | 39.2 | 38.7 | 39.7 |
| speciesChinstrap | 9.8 | 8.8 | 10.7 |
| speciesGentoo | 8.5 | 7.7 | 9.3 |
bind_rows(glance(m1), glance(m2)) |>
mutate(model = c("m1 (species & sex)",
"m2 (species only)")) |>
select(model, r2 = r.squared, adjr2 = adj.r.squared,
AIC, BIC, sigma, nobs) |>
kable(digits = c(0, 3, 3, 1, 1, 2, 0))| model | r2 | adjr2 | AIC | BIC | sigma | nobs |
|---|---|---|---|---|---|---|
| m1 (species & sex) | 0.807 | 0.804 | 909.0 | 925.4 | 2.37 | 198 |
| m2 (species only) | 0.703 | 0.700 | 992.6 | 1005.7 | 2.93 | 198 |
Which model has better in-sample performance?
m1_aug <- augment(m1, newdata = our_test)
m1_res <- m1_aug |>
summarize(val_R_sq = cor(bill_length_mm, .fitted)^2,
MAPE = mean(abs(.resid)),
RMSPE = sqrt(mean(.resid^2)),
max_Error = max(abs(.resid)))
m2_aug <- augment(m2, newdata = our_test)
m2_res <- m2_aug |>
summarize(val_R_sq = cor(bill_length_mm, .fitted)^2,
MAPE = mean(abs(.resid)),
RMSPE = sqrt(mean(.resid^2)),
max_Error = max(abs(.resid)))bind_rows(m1_res, m2_res) |>
mutate(model = c("m1 (species & sex)",
"m2 (species only)")) |>
relocate(model) |>
kable(digits = c(0, 3, 2, 2, 1))| model | val_R_sq | MAPE | RMSPE | max_Error |
|---|---|---|---|---|
| m1 (species & sex) | 0.841 | 1.75 | 2.29 | 6.6 |
| m2 (species only) | 0.718 | 2.54 | 3.06 | 8.2 |
Which model predicts better in the test sample?
m1) in the training sample; evaluate the quality of fit.m2) in the training sample; evaluate the quality of fit.432 Class 01 | 2024-01-16 | https://thomaselove.github.io/432-2024/